IMT573 Problem Set 5: Bayes theorem, distributions Your name: Kulraj Singh kohli Deadline: Wed, Nov 6th 3pm Instructions 1. You are welcome to answer some of the questions on paper but please include the result as an image in your final file (a cellphone photo will do). 2. Please write clearly. Imagine this is a business report for your boss. She does not care about coding but is very much interested in the results. Can she understand your text? Test it, by making all code invisible and ensuring one can still understand what you have written.

  1. Do not add irrelevant output! Every output you produce must be there for a good reason!

1 Overbooking flights

You are hired by Air Nowhere to recommend the optimal overbooking rate. It is a small airline that uses a 100-seat plane to carry you from Seattle to, well, nowhere. The tickets cost $100 each, so a fully booked plane generates $10,000 revenue. The sales team has found that the probability, that the passengers who have paid their fare actually show up is 98%, and individual show-ups can be considered independent. The additional costs, associated with finding an alternative solutions for passengers who are refused boarding are $500 per person.

  1. Which distribution would you use to describe the actual number of show-ups for the flight?

Hint: read OIS ch 3 about distributions.

Ans1.Binomial

  1. Assume the airline never overbooks. What is it’s expected revenue?

Expected revenue is 10,000

  1. Now assume the airline sells 101 tickets for 100 seats. What is the probability that all 101 passengers will show up? Ans3.
#binomial for the question
prob_101<-dbinom(101,101,0.98)
prob_101
## [1] 0.1299672

Probability is 12.9% 4. What are the expected profits (= revenue − expected additional costs) in this case? Would you recommend overbooking over selling just the right number of tickets?

#expected profits = revenue − expected additional costs
expected_profits<-10100-(prob_101)*500*1

Expected Profit is 10035.02 As you can see that overbooking is still going to generate higher expected profits hence overbooking is recommended.

  1. Now assume the airline sells 102 tickets. What is the probability that all 102 passengers show up?
#binomial for the question
prob_102<-dbinom(102,102,0.98)

The probability that all 102 passengers show up is 12.7% 6. What is the probability that 101 passengers still one too many will show up?

#binomial for the question
dbinom(101,102,0.98)
## [1] 0.265133

Probability that 101 passengers show up is 26.5%

  1. Would it be advisable to sell 102 tickets, i.e. is the expected revenue from selling 102 tickets larger than from selling 100 and 101 tickets?
#expected profits = revenue − expected additional costs
expected_profit_102<-10200-(2*dbinom(102,102,0.98)+dbinom(101,102,0.98))*500
expected_profit_102
## [1] 9940.066

We see that the expected profit is $9940.066 which is less than the expected profit of 10035 in the case of 101 tickets.

  1. What is the optimal number of seats to sell for the airline? How big are the expected profits?
#Writing a function to calculate the expected revenue till 108 and analysing trends
n<-101
 final<-c()
while (n<109) {
  x<-0
  sum<-10000+(n-100)*100
  expected_loss<-0
  while (x<(n-100)) {
    expected_loss<-expected_loss+((n-100)-x)*dbinom(n-x,n,0.98)
    x<-x+1
  }
  tot_expected_loss<-expected_loss*500
 
   
  final[n-100]<-sum-tot_expected_loss
  
  n<-n+1
}

final
## [1] 10035.016  9940.066  9713.848  9398.369  9036.920  8656.348  8269.083
## [8]  7879.791

We see from the above trend that the most optimal scenario is when the airline sells 101 tickets to customers.

  1. What does it mean that the show-ups are independent? Why is it important?

One passenger coming does not affect whether the other passenger is coming or not,means that they are independent events.

It is important because while calculating probability if the events were interdependent we cant use binomial distribution in that case.

Hint: read about independence in OIS 2.1.6 (2017 version). Hint: some of the expressions may be hard to write analytically. Feel free to use computer for the calculations, just show the code and explain what you are doing.

2 Does the student know the answer?

In the exam, there is a multiple-choice question with four (mutually exclusive) options. In average, 80% of the students know the answer, but event those who know, still answer it wrong in 10% of time because of the exam stress.

  1. If a student get’s the answer right, what is the probability that she actually knows the material? Hint: read OIS 2.2, in particular 2.2.7 (2017 version) Ans1.
ANSWER 2

ANSWER 2

2 3 Histogram and normal distribution

In this problem set you are comparing the humans in terms of body size (height) and influence. Strictly speaking we do not have data on human influence here, just research paper influence (citations) but it is a good proxy for influence of the individual humans (researchers).

Let’s start with the human height data.

citations<-read.csv("mag-in-citations.csv")
head(citations)
##   paper_id num_in_citations
## 1  4090687                2
## 2  6537979                2
## 3  7484482                4
## 4  9444380                3
## 5 14056478                5
## 6 14498457                2
human_height<-read.delim("fatherson.csv.bz2")
head(human_height)
##   fheight sheight
## 1   165.2   151.8
## 2   160.7   160.6
## 3   165.0   160.9
## 4   167.0   159.5
## 5   155.3   163.3
## 6   160.1   163.2
  1. You’ll work about human heights. What kind of measure is this? (nominal, ordered, difference, ratio)? How should it be measured (continuous, discrete, positive…)?

We see from the data that fheight and sheight have a true 0 value which is also well defined, hence it is a ratio.

It should be measured as continuous and positive.

  1. Read the fatherson.csv data. It contains two columns, father’s height and son’s height, (in cm). Let’s focus on fathers here (variable fheight) and ignore the sons. Provide the basic descriptives: how many observations do we have? Do we have any missings?
length(human_height$fheight)
## [1] 1078
table(is.na(human_height$fheight))
## 
## FALSE 
##  1078

We see from the above results that we have 1078 entries and none of them are missing.

  1. Compute mean, median, mode, standard deviation and range of the heights. Discuss the relationship between these numbers. Is mean larger than median? Than mode? By how much (in relative terms)? How does standard deviation compare to mean?

Hint: there is no built-in method to estimate mode in R. Several packages provide a way to do it, I recommend to use modeest::mlv (installedon the server).

However, as height is a continuous variable, there are many ways to compute it. Take a look at the corresponding documentation. You may experiment with a few options and pick one, for instance naive.

summary(human_height$fheight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   149.9   167.1   172.1   171.9   176.8   191.6
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}
getmode(human_height$fheight)
## [1] 175.4
sd(human_height$fheight)
## [1] 6.972346

The range of the human heights is 49 (149.9-191.9) , Std Deviation is 6.97 , mode is 175.4 , median is 172.1, mean is 171.9. (I assume these values to be in cm)

Mean is the same as median whereas mode is greater than mean.

The std deviation is less, means that most of our data lies around our mean.

  1. plot a histogram of the data. Add to this histogram
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
plot1<-human_height %>% ggplot(aes(x = fheight)) + geom_histogram(binwidth =0.5,color="black",fill="white", alpha = 0.4)
plot1

  1. plot of normal distribution with the same mean and standard deviation as the data.
plot2<-human_height %>% ggplot(aes(x = fheight)) + geom_histogram(aes(y=..density..),binwidth=0.5,color="black",fill="white",alpha =0.6)+stat_function(fun = dnorm ,color = "red",args=list(mean=171.9, sd= 6.97))+ scale_x_continuous(name = "Distribution of Height ")+ggtitle("Plot of Normal distribution with same Mean and Std Deviation vs Density plot of actual data")

plot2

  1. mean, median, and mode. You can use vertical lines of dierent color.
plot2 + geom_density(color="black") + geom_vline(aes(xintercept=mean(human_height$fheight)),linetype="dashed",color="magenta1")+ geom_text(aes(x=mean(human_height$fheight), label="Mean", y=0.01,), colour="magenta1",alpha = 0.01, angle=90, vjust =-0.5, text=element_text(size=50)) + geom_vline(aes(xintercept=median(human_height$fheight)),linetype="dashed",color="purple1")+ geom_text(aes(x=median(human_height$fheight), label="Meidan", y=0.03,), colour="purple1",alpha = 0.01, angle=90, vjust =1.5, text=element_text(size=20)) +
geom_vline(aes(xintercept=getmode(human_height$fheight)),linetype="dashed",color="midnightblue")+ geom_text(aes(x=getmode(human_height$fheight), label="Mode", y=0.03,), colour="midnightblue",alpha = 0.01, angle=90, vjust =1.5, text=element_text(size=20)) 
## Warning: Ignoring unknown parameters: text

## Warning: Ignoring unknown parameters: text

## Warning: Ignoring unknown parameters: text

What do you find? Are the histogram and the density plot similar?

We see from the graph the red line represents our normal distribution and the black line is the density plot for the given data of heights. They are pretty close to each other and hence we can say they are similar.

Next, let’s take a look at human influence. We are working with the number of citations of research papers.

  1. What kind of measure is this? What kind of valid figures would you expect to see (continuous, discrete, positive, …)
head(citations)
##   paper_id num_in_citations
## 1  4090687                2
## 2  6537979                2
## 3  7484482                4
## 4  9444380                3
## 5 14056478                5
## 6 14498457                2

We see that the citations column are discrete positive values and the have a well defined 0 hence are a ratio measure.

  1. Read the mag-in-citations.csv data. This is Microsoft Academic Graph for citations of research papers, and it contains two columns: paper id and number of citations. We only care about citations here. Provide the basic descriptives: how many observations do we have? Do we have any missings?
length(citations$num_in_citations)
## [1] 388258
table(is.na(citations$num_in_citations))
## 
##  FALSE 
## 388258

We see that there are 388258 citations and none of them are missings

  1. Compute mean, median, mode, standard deviation and range of the heights. Discuss the relationship between these numbers. Is mean larger than median? Than mode? By how much (in relative terms)? How does standard deviation compare to mean?
summary(citations$num_in_citations)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     1.00     3.00    15.61    12.00 18682.00
getmode(citations$num_in_citations)
## [1] 0
sd(citations$num_in_citations)
## [1] 78.39079

The range of the citations is 18682 (149.9-191.9cm) , Std Deviation is 78.390 , mode is 0 , median is 3, mean is 15.61.

Mean is not the same as median and median is smaller than mean whereas mode is lesser than median.

The std deviation is a large number, means that most of our data does not lie around our mean.

  1. plot a histogram of the data. Add to this histogram
citplot1<-citations %>% ggplot(aes(x = num_in_citations)) + geom_histogram(color="black",fill="white", alpha = 0.4)+ scale_x_log10()+scale_y_log10()
citplot1
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 84550 rows containing non-finite values (stat_bin).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1 rows containing missing values (geom_bar).

  1. plot of normal distribution with the same mean and standard deviation as the data.
citplot2<-ggplot(citations, aes(x = num_in_citations)) + geom_histogram(fill = "magenta3",alpha=0.2, color= "black") + scale_x_log10() + scale_y_log10() + ggtitle("Histogram of number of citations with Normal distribution")+ geom_density(aes(y=0.2*..count..), colour="black", adjust=4)

citplot2
## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 84550 rows containing non-finite values (stat_bin).
## Warning: Removed 84550 rows containing non-finite values (stat_density).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1 rows containing missing values (geom_bar).

  1. mean, median, and mode. You can use vertical lines of dierent color.
citplot2 +  geom_vline(aes(xintercept=mean(citations$num_in_citations)),linetype="dashed",color="orange1")+ geom_text(aes(x=mean(citations$num_in_citations), label="Mean", y=10,), colour="orange1",alpha = 0.01, angle=90, vjust =-0.5, text=element_text(size=50)) + geom_vline(aes(xintercept=median(citations$num_in_citations)),linetype="dashed",color="purple1")+ geom_text(aes(x=median(citations$num_in_citations), label="Meidan", y=100,), colour="purple1",alpha = 0.01, angle=90, vjust =1.5, text=element_text(size=20)) +
geom_vline(aes(xintercept=getmode(citations$num_in_citations)),linetype="dashed",color="midnightblue")+ geom_text(aes(x=getmode(citations$num_in_citations), label="Mode", y=100,), colour="midnightblue",alpha = 0.01, angle=90, vjust =1.5, text=element_text(size=20)) 
## Warning: Ignoring unknown parameters: text

## Warning: Ignoring unknown parameters: text

## Warning: Ignoring unknown parameters: text
## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 84550 rows containing non-finite values (stat_bin).
## Warning: Removed 84550 rows containing non-finite values (stat_density).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1 rows containing missing values (geom_bar).

What do you find? Are the histogram and the density plot similar? Hint: experiment with log-log scale for the histogram. 9. Finally, comment on your findings about human bodies and influence. ->We see that the human height follows a curve similar to that of normal distribution. ->We see that most number of citations are 0 -> We see that the data of citations is right skewed. ->We see that the median and mode of the human data is almost the same,whereas the median is smaller than the mean in citations data. -> We see that humans have an normal distributin of height, which means that the have a common median and mode most of the heights are at this value and the std deviation is less that means the data does not vary too much -> We see that most people have zero or very less citations and that is the reason why its histogram is right skewed, the std deviation is large that means the data is scattered and there are large outliers.